Understanding the Logical and Semantic Structure of Large Documents

نویسندگان

Muhammad Mahbubur Rahman

Timothy W. Finin

چکیده

Current language understanding approaches focus on small documents, such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents like legal briefs, proposals, technical manuals and research articles is still a challenging task. We describe a framework that can analyze a large document and help people to know where a particular information is in that document. We aim to automatically identify and classify semantic sections of documents and assign consistent and human-understandable labels to similar sections across documents. A key contribution of our research is modeling the logical and semantic structure of an electronic document. We apply machine learning techniques, including deep learning, in our prototype system. We also make available a dataset of information about a collection of scholarly articles from the arXiv eprints collection that includes a wide range of metadata for each article, including a table of contents, section labels, section summarizations and more. We hope that this dataset will be a useful resource for the machine learning and NLP communities in information retrieval, content-based question answering and language modeling.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

Modern literary interpretation in understanding the meaning of the verse ‘There is nothing like Him’

Numerous views have been expressed by commentators and writers about the literary aspect and the meaning of the Qurchr('39')anic phrase "There is nothing like Him". The sequence of the words "ka" and "like" in the holy verse, has led to two literary and semantic illusions. The literary illusion is that "ka" seems to be redundant and the semantic illusion is the word ‘like’ indirectly proves the...

متن کامل

Semantic Indexing of Technical Documentation

This research takes place in an industrial context: the CONTINEW Company. This company ensures the storage and security of critical data and technical documentation. Consequently, it is necessary to organize these documents in order to retrieve quickly critical information. The management of this increasing volume of documents requires document classification which is based on indexing techniqu...

متن کامل

A Document Reuse Tool for Communities of Practice

With the rise of the Internet, virtual communities of practice are gaining importance as a mean of sharing and exchanging information. In such environments, information reuse is of major concern. In this paper, we outline the importance of enriching documents with structural and semantic information in order to facilitate their reuse. We propose a framework for document reuse based on an explic...

متن کامل

A Model for Conformance Analysis of Software Documents

During the evolution of a large-scale software project, developers produce a large variety of software artifacts such as requirement specifications, design documents, source code, documentation, bug reports, etc. These software documents are not isolated items — they are semantically related to each other. They evolve over time and the set of active semantic relationships among them is also dyn...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1709.00770 شماره

صفحات -

تاریخ انتشار 2017

Understanding the Logical and Semantic Structure of Large Documents

نویسندگان

چکیده

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Modern literary interpretation in understanding the meaning of the verse ‘There is nothing like Him’

Semantic Indexing of Technical Documentation

A Document Reuse Tool for Communities of Practice

A Model for Conformance Analysis of Software Documents

عنوان ژورنال:

اشتراک گذاری